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Abstract 

f~^ We report (to our knowledge) the first evaluation of Constraint Satisfaction as a computational 

^— I framework for solving closest string problems. We show that careful consideration of symbol occur- 

^D rences can provide search heuristics that provide several orders of magnitude speedup at and above 

vN the optimal distance. We also report (to our knowledge) the first analysis and evaluation - using any 

J>-j technique - of the computational difficulties involved in the identification of all closest strings for a 

given input set. We describe algorithms for web-scale distributed solution of closest stting problems, 
both purely based on AI backtrack search and also hybrid numeric-AI methods. 



I — I 1 Introduction 

l-H 

^ The closest string problem (CSP) takes as input a set of strings of equal length over a fixed alphabet. 

jj A solution is a string with the smallest possible maximum Hamming distance from any input string. 

'— ' (Strictly speaking, distance with respect to any suitable metric can be minimised; the Hamming distance 

,__! is the standard edit distance metric used for this class of problems.) CSP has applications in coding and 

;> information theories (in these fields the problem is also known as minimum radius), but when the input 

strings consist of nucleotide sequences over the letters A, C, G and T, (or of mRNA sequences over the 

Q letters A, C, G and U, or of amino acid sequences over an alphabet of size 20) the CSP has important 

O applications in computational biology (where the problem class is also known as centre string). Examples 

If^ include the identification of consensus patterns in a set of unaligned DNA sequences known to bind a 

O common protein |[T2l . finding conserved secondary structure motifs in unaligned RNA sequences fUM . 

,__! discovering motifs in ranked lists of DNA sequences f6l , finding DNA regulatory motifs within unaligned 

L| noncoding sequences [22], the identification of sister chromatids by DNA template strand sequences [81, 

•'H and DNA motif representation with nucleotide dependency |2|. Our aim is to provide theoretical and 

rS practical results - together with empirical supporting evidence - that lead to improved CSP solution for 

d biological problems, so in this paper the base alphabet £ will always consist of four symbols. 

A Constraint Satisfaction Problems (CSP) consists of a set of constraints involving variables taking 
discrete values. A solution to a CSP is an assignment of values to variables such that no constraint is 
violated. CSP solvers are used for many important classes of problems for which solutions must take 
discrete values, but, to our knowledge, the closest string problem has not been modelled and solved as 
a CSP. The research question under consideration, therefore, is "Is CSP a useful framework for solving 
CSP instances?" 

In this paper we investigate approaches to developing and solving such models. We demonstrate that 
a careful choice of search heuristic can give several orders of magnitude speedup in general. We show 
that CSP modelling and solution are effective tools for the related problem of obtaining all closest strings. 
We consider the distribution of closest string problems across a cloud (or grid, or cluster) of computing 
nodes, and identify two potential super-Unear speedups that can be achieved in practice. Finally we 
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identify the strengths and weaknesses of existing numeric approaches, and suggest hybrid discrete and 
numeric methods that combine the best features of CSP search and numeric search for solutions. 

In the rest of this introduction we discuss existing methods for the CSP with respect to theoretical 
complexity results, give brief overviews of Constraint Satisfaction theory and the Minion CSP solver, 
and formally define the theoretical concepts upon which the research is based. In Section [2] we model 
closest string problems as CSPs, compare search heuristics, and provide results for the all closest string 
problem. We describe distributed algorithms in Section[3] both for pure CSP models and heuristics, and 
for hybrid CSP-numeiic methods. In Section |4] we discuss the relative strengths and limitations of CSP 
as a framework for closest string identification, and identify future avenues of research. 

1.1 Computational Complexity and Existing Methods 

CSP has been shown to be NP-complete for binary strings Q and for alphabets of arbitrary size fT4\i. 
Intuitively there are |r| choices for each of the L positions in any candidate closest string where £ is the 
alphabet, and for any algorithm that fails to check each of this exponential number of cases one could 
devise a CSP for which the algorithm returns an incorrect result. 

Approximate solutions to within (4/3 + e) of the minimal d can be obtained in polynomial time lfT4l 
[T6]| . with several practically useful implementations available, notably those based on genetic algorithms 
|[T3l . However, in this paper we are concerned with first finding exact solutions, and then (given that we 
know the minimal distance d) finding all closest strings that are within d of S. Clearly, an approximate 
method will not, in general, identify the minimal d, and therefore can not be used as a basis for finding 
all solutions. 

Excellent exact results - provided that close bounds have already been identified - have been obtained 
by modelling the CSP as an Integer Programming Problem lITTl . and solving the resulting instances using 
numerical branch and bound methods [15]. This form of search differs from the backtrack search used 
by CSP solvers by having a much less organised search pattern. This is often advantageous, but can be a 
hindrance when searching for all solutions: IP branch and bound is optimised for optimisation, as it were, 
rather than exhaustive search for all candidates for a constant objective function. If the IP formulation 
suggested in IITtII is used, then the feasible region deliberately excludes optimal solutions in order to 
reduce the numbers of variables, in which case no search for all solutions can be made. 

A linear time algorithm exists for solutions to the CSP for fixed distance d [11]. The exponential 
complexity is now in the coefficient, as the method is 0{NL + Ndd'^) where the problem has A^ strings 
of length L. 

1.2 Constraint Satisfaction Problems 

Definition. A Constraint Satisfaction Problem T is a set of constraints '^ acting on a finite set of vari- 
ables A := {Ai,A2, . . . ,A„}, each of which has a finite domain of possible values D,- := D{Ai) C A. A 
solution to T is an instantiation of all of the variables in A such that no constraint in ^ is violated. 

The class of CSP^ is NP-complete as it is a generalisation of propositional satisfiability (SAT). The 
Handbook of Constraint Programming 11211 provides full details of CSP theory and techniques. A key 
observation is that different models (i.e. choices of variables, values and constraints) for the same prob- 
lem (or class of problems) will often give markedly different results when the instances are solved, but, 
as with numeric Linear, Mixed-Integer and Quadratic Programming, there is no general way to decide in 
advance which candidate models and heuristics will lead to faster search. 

A typical solver operates by building a search tree in which the nodes are assignments of values 
to variables, and the edges lead to assignment choices for the next variable. If a constraint is violated 
at any node, then search backtracks. If a leaf is reached, then all constraints are satisfied, and the full 
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set of assignments provides a solution. Tiiese search trees are obviously exponential, and in the worst- 
case scenario every node may have to constructed. However, large-scale pruning of the search tree 
can occur by judicious use of consistency methods. The idea is to do a small amount of extra work 
that (hopefully) identifies variable-value assignments that are already logically ruled out by the current 
choice of assignment, meaning that those branches of the search tree need not be explored. While there 
are no guarantees that this extra work is anything other than an overhead, in practice enough search is 
pruned to give efficient solutions for otherwise intractable problems. Taking a specific example from 
the empirical evaluation reported later in this paper, an all closest string problem for a fixed distance 
involving strings of length 25 with a 4-symbol alphabet will require at most 4^^ ?» 1.1 x 10^^ nodes to 
be searched. An efficient solver will search only 3 or 4 x 10^ nodes, with the remainder being ruled out 
by efficient propagation of the logical results of the assignments during search. Moreover, an efficient 
solver will search around 300,000 of the remaining nodes per CPU second. It is this efficient reduction 
in search space that allows CSP practitioners to solve otherwise intractable problems. 

Heuristics exist for choices of variable-value pair for the next node, and as before these may have 
no effect on the number of nodes visited. Again, in general, variable and value orderings designed for 
specific problem classes can lead to several orders of magnitude reduction in the number of nodes needed 
to find a solution. Standard choices for variable orderings include random, smallest domain, largest 
domain, most-constrained (i.e. chooses avariable that appears in a maximal number of constraints), 
least-constrained, etc. Results will vary with the problem class and model under consideration. Taking 
another specific example from experiments in this paper involving closest string problems, enforcing 
singleton singleton arc consistency - a limited depth procedure that aims to prune entire branches near 
the root, see HI for a full analysis - at the root node of a search tree can be a huge loss. It can take three 
times longer to reach the first closest string at a given distance than it takes to find all closest strings. 

In summary, the solution performance for instances of a class of CSPs will depend crucially on 
choices of model, consistency, search order and the solver used. Moreover, empirical evaluation is often 
the only way to decide which of these choices is better for a given set of circumstances. 

1.2.1 The Minion CSP solver 

The constraint solver Minion ifTOl uses the memory architecture of modern computers to speed up the 
backtrack process compared to other solvers. Minion has an extensive set of constraints, together with 
efficient propagators that enforce consistency levels very rapidly. Minion has been used to solve open 
problems in combinatoric algebra [5|, finding billions of solutions in a search space of size 10^"*' in a 
matter of hours. Minion is used as the solver for this investigation as it offers both fast and scalable 
constraint solving, which are important factors when solving closest string problems. Moreover, the user 
can easily specify bespoke variable orderings, and less easily specify value orderings. 

1.3 Formal Definitions and Results 

Before proceeding to the technical Sections, we first formalise Hamming Distances and Diameters, and 
closest strings: 

Definition. Let Si and S2 be strings of length L over an alphabet Z. Let D be the binary string of length 
L such that 

1 Si{i)^S2{l) 



otherwise 
The Hamming Distance hd{S\ , S2) is defined as the sum from i = \ to L of the D{i). 
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Definition. Let S = {5'i,5'2, . . . ,Sn} be a set of strings of length L over an alphabet Z. A Closest String 
to S is defined as any string CS of length L over £ such that 

hd{CS,Si)<d VrG{l,2,...,A^} 

with d being the minimal such distance for S. The Hamming diameter HD ofS is defined as 

HD{S)=max{hdiSi,Sj)) Mi J G {1,2,...,A^}. 

A solution to a closest string problem involving the strings in S is therefore a string CS and a minimal 
distance d such that each member of S is within d of CS. The Hamming distance is an edit distance 
that quantifies the number of substitutions from £ required to turn one string into another. It is easy to 
show that Hamming distance is a metric, satisfying the triangle inequality. It is clear that the Hamming 
Diameter is an upper bound for the distance of a closest string: a candidate closest string at a greater 
distance can be replaced by any member of S, reducing the maximal distance to HD{S). We can obtain 
a lower bound for the distance of a closest string by observing that the distance can not be less than half 
the Hamming Diameter: 

Lemma 1. Let S = {Si,S2,- ■ ■ ,Sn} be a set of strings of length L over an alphabet £. A closest string 
CS to S must be within \HD{S)/2\ ofS. 

Proof. Let 5, and Sj be two strings from 5 for which the Hamming Diameter is achieved, and let Sj^ be 
any other string of length L over £. By the triangle inequality // (5) =hd{Sj,Sj) <hd{Si,Sk)+hd{Sj,Sk). 
If (without loss of generality) M (5;, 5^) < \HD{S)/2] thenhd{Sj,Sk) > \HD{S)/2]. Hence any distance 
less than \HD{S)/2\ can not be a maximal distance from Sk to any string in S. D 

Search space reduction can be achieved by noting that any value not appearing in position j of any 
of the strings in S need not appear in a closest string solution. It should be noted that this only applies 
when searching for the first optimal solution. When searching for all solutions, any symbol from L can, 
in principle, appear at any position in CS. 

Lemma 2. Let S = {Si,S2, ■■■ ,Sn} be defined as in LemmaU\ LetLjfor j € 1,2, ... ,L denote the subset 
of £ obtained by selecting every symbol that appears in position j of a string in S. Then any symbol in 
position j of a closest string to S must also be in Zy. 

Proof. Suppose symbol 5 in £ \ £y appears in position j of a solution CS. Let CS* be the string consisting 
of CS with s replaced by a symbol from Zy at position j. Then CS* is strictly closer to those strings in S 
with that symbol at that position, and distance to all other strings is unchanged. Hence if the current d is 
optimal for CS, it remains optimal for CS*. D 

The final definition needed for this investigation encapsulates frequencies of symbol appearances per 



string position, and will be used in Section 2.1 to direct backtrack search for closest strings 



Definition. Let S = {Si, S2, ■■■ ,Sn} be a set of strings of length L over an alphabet^. A Position Weight 
Matrix (PWM) /or S is an |r| x L matrix with entries PMSs{i,j) defined as the frequency of symbol i 
appearing at position j in S. 

An example Position Weight Matrix is given in Figure [T] 
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aGgtacTt 
CcAtacgt 
acgtTAgt 
acgtCcAt 
CcgtacgG 



A 30103110 

C 24001400 

G 014 0031 

T 510 14 

Figure 1 : Five strings of length 8 are shown above, with their PWM shown below. 

2 Closest String as a Constraint Satisfaction Problem 



Using the terminology from Sections 1.2 and 1.3 we now construct a CSP instance from a given closest 
string problem. For the purposes of this paper, the alphabet £ consists of the numbers 1, 2 ,3 and 4, 
representing A, C, G and T respectively. Clearly this artificial restriction can easily be relaxed in order 
to model arbitrary alphabets. 

Given S, a set of A^ strings of length L over alphabet £, we first compute the Hamming Diameter 
HD{S) and use this to provide a lower bound, dmi„, for the optimal distance d, as shown in Lemma [T] 
T{S,dmin,HD{S)) denotes the CS'/' instance in which the set of variables is A := Ai UA2UA3UA4, where 

1. Ai is the array [CSi,CS2, ■ ■ ■ ,CSl] of variables representing the closest string, each such variable 
having domain 1 through 4 

2. A2 is an A/^ X L array of binary variables used to calculate Hamming Distances from Ai to the input 
strings S 

3. A3 is the aiTay [Di ,D2, . . . ,Di\i] of variables representing the distance of each string in S to the 
current CS candidate, each such variable having domain d,„in through HD(S) 

4. A4 is the single distance variable d with domain dmin through HD{S). 
The constraints are: 

1. A2{i,j)=0iffSi{j)=A,{j) 

2. A3 {k) is the sum of row k of A2 

3. A4 is the maximum value appearing in A3 
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4. A4 is minimised: if a solution is found with A4 = d, search for another solution with A^ = d — 1 
(unless d = dmin)- 

Ai are the search variables: nodes of the search tree consist of values assigned to these variables. A4 is 
the objective function (or cost function). A returned solution is A] U A4, a closest string together with 
the optimal distance. Solving T{S,d,„in,HD{S)) is guaranteed to return a solution, although it is not 
impossible that all 4^ nodes are visited for every current minimal d. Restricting the domains of A3 will 
save computational effort when a solution is found with d = dmin and will have no effect otherwise. 
Restricting the domains of the Ai variables in line with lemma [2] also reduces the search space, although 
the restrictions can not apply when searching for all solutions. 

To find all closest strings T(5, dmm,HD{S)) is solved to obtain CS and dgpt- By restricting the domains 
of A3 to dmin through dopt and removing the optimisation constraint we obtain a new CSP T* {S, dmin , d) 
which can be solved for all solutions. The search undertaken to find the first solution CS need not be 
repeated: constraints can be added that rule out those parts of the search tree already processed. It should 
also be noted that CS and d need not be obtained using the Constraint Satisfaction approach: any method 
that returns an optimal solution can be used to create an all closest strings CSP. 

2.1 Position Weight Matrix Variable and Value Ordering 

We now use results from computational biology to devise a bespoke variable and value ordering schema 



for T(5, dmin ) ■ By precalculating a Position Weight Matrix for S as defined in Section 1 .3 we can order the 
search variables by maximum frequency. For each variable, we order the values assigned during search 
by decreasing relative frequency. Tie breaks are either random or by least index. In the example given as 
Figure[T]the variable ordering by position 1 through 8 would be 1: position 4 (having 5 occurrences), 2 - 
5: positions 2, 3, 6, and 8 in any order (each having 4 occurrences), 6-8: positions 1 and 7 in any order 
(having the least highest frequency of 3). The value ordering for position 5 in the figure would be 1: A 
(most frequent), 2-3: T and C in any order, 4: G (least frequent). By Lemma |2} when seeking a single 
solution it is safe to exclude values that don't appear at a given position from their respective variable 
domains before search. Hence in for the example in Figure [T] the value ordering would be values typeset 
in blue followed by values typeset in red, with black values excluded. 

The idea behind this search heuristic is that search starts close to (in the sense of maximum likeli- 
hood) an optimal solution. Only if no such solution is found does search progress to less likely (but not 
impossible) parts of the search tree . 

2.2 Comparison of Search Heuristics 

In this Section we test the hypothesis that PMS-based search heuristics reduce the search needed for 
solutions to T{S,dnnn) CSP instances when compared to a standard heuristic. Figure [2] illustrates the 
results from 200 closest string problems. Each problem was run first with smallest domain variable 
ordering and ascending value ordering (Minion defaults), and then with PWM-based variable and value 
ordering as described in Section [2?T] 

For exact solutions - upper panel of Figure [2]- we observe an improvement of PWM over SDF in 
almost all cases. The speedup is as high as several orders of magnitude in some cases. The difference is 
statistically significant at the 0.001 level. These results are as expected: the PWM reflects the maximum 
likelihood of a closest string, so a search that respects these likelihoods will nearly always be highly 
efficient, but will visit very many non-essential nodes on the few occasions that the maximum likelihood 
does not lead to a closest string. A key observation is that the magnitude of speedup increases with 
increasing string length, which is highly encouraging since the complexity of closest sting is exponential 
in string length. 
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Figure 2: Empirical data from 200 instances of 3, 4, 5 and 6 strings of lengths 10, 15, 20, 25 and 30. 
For each combination of parameters, 10 random instances were generated with results summarised in the 
boxes which show median values (thick line), 25th-75th percentiles (boxed) and Oth-lOOth percentiles 
(dashed lines). In the top panel we compare the exact optimal solution times. In the lower panel we show 
the times taken to obtain an optimal result, omitting the time needed to certificate that result. In both 
figures the y axis shows the speedup of Position Weight Matrix over Smallest Domain First ordering on 
a logarithmic scale, and the times given below the string lengths are the median CPU time taken over 
all strings of that length. The experiments were conducted on a dual quad-core 2.66 GHz Intel X-5430 
processor with 16 GB of RAM. 
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If we only consider the search effort needed to find an optimal solution (not taking into account the 
work needed to provide a certificate of optimality by ruling out closer strings at lower distance) then 
the speedup of PWM over smallest domain is at the level of orders of magnitude in the general case - 
Figure |2] lower panel. This indicates that heuristics are less important when searching exhaustively at a 
lower than optimal distance: most of the practical complexity of closest string search is associated with 
providing certificates of optimality, rather than identifying close strings which turn out to be optimal. 

times optimal distance 




Figure 3: Convergence towards optimal Hamming distance. The upper line is the maximum relative 
distance, the lower line the minimum and the middle line the mean of the 200 experiments performed. 
The y axis denotes multiples of the optimal Hamming distance, the x axis denotes CPU time for the 
PWM heuristic on a logarithmic scale. The experiments were conducted on a dual quad-core 2.66 GHz 
Intel X-5430 processor with 8 GB of RAM. 



Figure [3] shows that we achieve a good approximation very quickly, in line with existing results that 
guarantee approximation to four thirds of optimality in polynomial time fW.T^. This motivates the hy- 



brid symbolic-numeric methods detailed in Section 3.4 practitioners can use CSP to obtain good bounds 
quickly, then use either numeric methods or AI search methods - or indeed both using a distributed 
architecture - to explore the remaining search space for an exact solution plus certificate of optimality. 
Taken together the results indicate that: 

1 . CSP search with PWM variable- value ordering will (in general) efficiently find candidate solutions 
to closest string problems with decreasing maximum distance d 

2. CSP search with any sensible search ordering can be used to exhaustively rule out the distance 
below the optimal d 

3. our empirical evidence is in line with previously reported results: an approximate solution to 
closest string can be computed in polynomial time, but computation of the necessary certificates 
of optimality remains intractable in the general case 

4. sequential, single-processor CSP search for problems having more strings of greater length (and 
possibly a larger alphabet) will become intractable due to the inherent NP-completeness of closest 
string. 
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Figure 4: Empirical data from 6 instances of 5 randomly generated strings of length 25. SDF and PWM 
indicate smallest domain and position weight heuristics respectively. All timings were calculated using 
a dual quad-core 2.66 GHz Intel X-5430 processor with 16 GB of RAM. 



2.3 All Closest Strings 

To our knowledge, no study has investigated the problem of finding all closest strings for a given set S. 
This may be due to the additional computational complexity involved: it is hard enough to find single ex- 
act closest strings without performing a systematic search for all such strings having the same maximum 
distance from S. It may also be the case that the problem is not interesting: the important information 
in a CSP solution being the distance returned, with the closest string being merely an exemplar at that 
distance. It seems likely, however, that knowledge of how much self similarity an input set has - rather 
than just the degree of self similarity - could be useful information in sequence analysis. 

Despite this uncertainty, we wish to investigate the effect that modelling has on the set of all solutions. 
In our CSP model described in Section[2]we reduce the search space for a first solution by forbidding any 
variable to take a value that is not present at that position in one of the input strings. Similar restrictions 
were made by Meneses et al. when formulating CSP as an Integer Programming problem lITTll . The 
questions are: 

1. Are many otherwise closest strings ruled out by these restrictions? 

2. How much extra computational effort is required to identify each and every closest string? 

In Figure |4] we show the results of sample calculations for six instances of 5 randomly generated 
strings of length 25 . We see that in all cases (columns headed PWM Restricted Domains) it is relatively 
easy using Minion to identify all closest strings if we restrict search to those alphabet symbols that have 
non-zero values in the position weight matrix for the instance. We also find that search using PMW 
ordering heuristics performs marginally better than straightforward smallest domain heuristics. 

When the variable-value assignments that have been ruled out because the value does not appear 
at the variable's position in any element of 5 are added back to the domains of the search variables, 
we can perform full search for all closest strings (Figure |4} columns headed by Unrestricted Domains). 
The percentage of new closest strings found ranges from 0% to 22%, but increase in search required is 
typically two orders of magnitude. It should be noted however that: 

1 . Minion is searching far fewer than the 4^^ possible search nodes for each instance, the majority be- 
ing pruned by efficient propagation of the logical consequences of the variable- value assignments 
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implicit at each node, and 

2. Minion is searching 250,000 - 300,000 nodes per second in addition to the work involved in 
identifying search sub-trees that need not be explored. 

3 Distributed and Hybrid Computing Strategies 



Input : A CSP T, a cutoff period T^ax and a branching factor K 
Output: Either the first solution, or a guarantee that there are no solutions 

while not Solved? (T) do 
Send T to a node 

Run solver with input T for < f < T^ax 
if Solved? (T) then 
Return solution 
else 

T -^ T with new constraints ruling out search already performed 
Split T into K subproblems Ti ,T2, . . . ,Tjf 
do in parallel 

for 1 < ;t < /«: do 

I Solve(T^, r^^,^) 
end 
end 
end 
end 

Algorithm 1: A recursive distributed algorithm to solve any CSP 



3.1 Distributed CSP 

Given the inherently exponential increase in search effort involved in providing a certificate for an opti- 
mal closest string distance by ruling out any closest strings with with lower distance, the exact solution 
of large-scale problems is not expected to be tractable using purely sequential search. In this Section we 
describe algorithms that distribute search across multiple compute nodes. These algorithms will solve 
closest string problems either on a cluster (a local group of homogeneous nodes), a grid (a more loosely 
coupled, heterogeneous and geographically dispersed set of nodes), or a cloud (a set of an unknown 
number of nodes in unknown locations, each having unknown architecture and resource). Generally 
speaking, a cluster is more controllable but smaller than a cloud, with a grid being either the best or 
worst of both worlds, depending on one's point of view. For our purposes we do not require any com- 
munication across nodes (although computational efficiencies could be obtained if that were the case), 
and can therefore treat the three distributed paradigms as a single approach. The only disadvantage to 
using a cloud is that empirical evaluation is often impossible since the times reported in virtual machines 
are not reliable. This is because clocks of virtual machines can be slowed down or sped up by the VM 
management software. We therefore prototype our computational methodology on a cluster, and, when 
satisfied that it is efficient, deploy using a cloud to take advantage of the very large number of nodes 
available. 

Algorithm [T] gives the basic structure of our distributed search. The predicate Solved? returns true 
whenever the input CSP finds the first solution or finishes searching the entire tree without finding a 
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solution. It returns false if either the computation has timed out, or the node has suddenly stopped 
working for some reason. If all solutions to the input CSP are required, then we modify Algorithm [T] so 
that all solutions found so far are returned whenever the Solved? predicate fails. 

It should be stressed that Algorithm[T]is not a contribution to the results of this paper. The algorithm 
has been implemented, tested, optimised and deployed on clusters, grids and clouds. It has been - and is 
being - used to attack CSP instances requiring an estimated 200 CPU years for exact solution lH. 

3.1.1 Distributed CSPs Using Minion 

In common with Integer Programming problems, CSPs distribute naturally across multiple compute 
nodes [J |. Significant research has been invested in the distribution of CSPs across multiple computers ||3l 
|25l[T8l. In particular the area of balancing the load among the nodes is an area of active research |[20l . 

Instead of the more sophisticated approaches, we choose a simple technique that does not impose 
any constraints on the problem to be solved and is targeted towards very large problems. Our algorithm 
closely follows Algorithm [T] - we run Minion with a time out and when this time out is reached, we 
split the remaining search space into two parts. The subproblems are inserted into a FIFO queue and 
processed by the computational nodes, splitting them again if necessary. 

One of the drawbacks of our approach is that it does not parallelise small problems well. For n 
compute nodes, we only achieve full capacity utilisation after log2n splits, i.e. after timeout x log2« 
seconds. We do not consider this to be a limiting factor however because the split timeout can be adapted 
dynamically to at first quickly split the problem and when full utilisation has been achieved increase it. 
For the large problems we have focused on when implementing this technique, requiring days or even 
years of CPU time, this is not a limiting factor. 

The main advantage of our way of distributing problems over other approaches is that we explicitly 
keep the split subproblems in files. This means that at any point we can stop, suspend, resume, move 
or cancel the computation and lose a maximum of timeout x n seconds of work, much less in practice. 
Apart from contributing to the robustness of the overall system, we can also easily move subproblems 
that cannot be solved using the available computational resources, for example because of memory lim- 
itations, to nodes with a higher specification that are not always available to us. 

In the absence of global symmetry breaking constraints that can affect different parts of the search 
tree, it is easy to subdivide a typical CSP into several non-overlapping sub-problems. Although there is 
an inherent latency in sending problem instances to, and receiving solutions from, either a grid or a cloud, 
for large enough problems a speedup linear in the number of compute nodes is achieved. Recent results 
using a computational grid indicate that a super-linear speedup can be achieved using Minion, whenever 
a root node consistency check reduces the search tree [4]. There is no guarantee of this, however, since 
root consistency checks are heuristics that will at times provide no benefit for the extra work involved. 

Cloud computing is becoming an important computational paradigm, and the Minion developers have 
produced robust, fault-tolerant, methods for distributing Minion instances across different underlying ar- 
chitectures, including clouds. By leveraging existing technologies, in particular the Condor distributed 
computing framework ll23l . we can distribute problems across hundreds of CPUs and combine clus- 
ter, grid and cloud architectures for web-scale computing. This enables us to tackle problems which 
have previously been thought to be unsolvable because of the amount of computation required to find a 
solution. 

3.2 Distributed Closest String 

Algorithm [2] describes our approach to the distributed solution of closest string problems formalised as 
Constraint Satisfaction problems. We first run Minion on the original problem with the PWM ordering 
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Input : T(5, d,nm , HD{S) ) , T,„ax and K 

Output: A closest string to S with its maximum Hamming distance to S 

for < f < T,na, do 

RunT {S,d,„i„,HD{S)) in Minion 
if Solved? (.T{S,dmin,HD{S))) then 

Return CS and d, and halt all computation 

else dhigh ■^ the best d found so far 

T{S,d,„in,HD{S)) ^ T{S,dmin, dhigh) plus constraints ruling out search already 

performed 
end 
end 
do in parallel 

for dmin < diow < dhigh do 

DistSolve(T*(5,<i,„,-„,<i/o„,), T,nax, K) 

if Solved? (J* {S, dmin, diow)) then 

Return CS and d = diow, and halt all computation 
else Update all (sub-)instances with new lower bound diow + 1 
end 
end 
for2<yc<ii:+ldo 

DhtSolve{Tk{S, dmin,dio„), T^ax, K) 

if Solved? (J k{S,d,nin, dhigh)) with dnew < dhigh then 

if dnew = dhigh — 1 then Return CS and d = dmw, and halt all computation 
else Update all (sub-)instances with upper bound d^gh = dnew 
end 
end 
end 
iSnot Solved?( Tk{S,dmin-,dhigh) V /c) A not Solved? {any fixed diow instance) then 

Return current dhigh as d, and the string found that achieved distance dhigh as CS 
end 
Algorithm 2: Solve the CSP T{S,dmin,HD{S)) by distributing search for high and low distances 



heuristic as a single process. Our empirical evaluation in Section 2.2 indicates that nearly always this 
process will highly efficiently lower the upper bound for the problem. Once we have a reasonable upper 
bound, we start searching for the optimal distance both above and below. From above, we carry on 
optimising as before, but we use the recursive DistSolve algorithm to distribute. From below we create 
instances each having a fixed distance, the idea being to exhaustively rule out any closest strings at these 
distances. These instances are run on the computational nodes at the same time as the optimisation sub- 
problems. If at any stage we obtain a candidate closest string at a distance for which all lower distances 
have been ruled out, then this is our solution. This can happen both from above and below. 



As mentioned in Section 3.1.1 we expect a super- linear speedup by performing a root node consis- 
tency check for each sub-instance. By keeping track of the best distance obtained so far during search 
from above, and of any lower distances for which no solution has been found, we expect to obtain a 
further super-linear speedup in the majority of instances. A large part of the search tree is pruned by 
updating all instances (either waiting for input to a node, or currently being processed by a node) with 
improved distance bounds as they become available. 
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3.3 Preliminary Evaluation of Distributed Closest String 

For a first evaluation, we ran the algorithm on 6 random strings of lengths 25, 26, 27, 28, 29 and 30. We 
chose a time limit of 1 hour to reduce communication overheads. The problems with strings of length 
25, 26 and 27 were solved to completion within this limit. 

The remaining three instances were split after one hour and distributed across multiple machines. As 
suggested by Figure [3] the solutions converged towards the optimal distance extremely quickly. For only 
one of the instances was a better Hamming distance found in one of the sub-instances. The remaining 
sub-instances proved the optimality of the previously found solution. 

These tests demonstrate the practical applicability of our distributed approach. We have not per- 
formed a large-scale evaluation, nor have we obtained evidence for the super-linear speedups associated 
with bounds updates and an increased number of consistency checks at the root of sub-instances. Our 
experience with the distributed solution of other classes of CSP suggests that our system will scale 
seamlessly to grids or clouds containing an essentially unlimited number of compute nodes: there is no 
communication across nodes, a node failure can be recovered from with no extra search needed (the 
search tree already explored is reported whenever search is interrupted for any reason), and the order in 
which sub-instances are solved can be tuned. 

3.4 Hybrid Methods 

Input ■.T^{S,d,„in,HD{S)) 

TOL, a limit for the gap between the highest and lowest computed distances 
Output: A closest string to S with its maximum Hamming distance to S 

Seek closer distance bounds for T" (5, dmin , HD{S) ) using CSP alone; 
while \dhigh — diow\ < TOL do 

Run Algorithm|2|on T^{S,d,„i„,HD{S)) 

Output diow and (i/,,^/, when updated 
end 
Once bounds are close enough, send to numeric IP or linear time search; 

if TOL > 1 A \dingh - di„,^,\ < TOL then 

Formulate the remaining problem as an Integer Programming problem 

Search for solution using numeric branch and bound 
end 

if \dhigh — diow\ = TOL = 1 then 

Formulate the remaining problem as a fixed d instance 

Search for solution using Hnear time methods 
end 
Algorithm 3: Solve the CSP T{S,dmin,HL){S)) using hybrid CSP and numerical methods 

The empirical results obtained so far suggest that CSP formulation with PWM ordering is an effec- 
tive approach for ruling out high distances: Minion will often find a first solution very quickly, given the 
search space involved. However, at least for the approach suggested in this paper, CSP formulation re- 
quires much more time to provide a certificate that an optimal solution is indeed optimal. As discussed in 
Section [TTT] efficient methods have been reported in the literature for when the upper and lower distance 
bounds are close ifTTl . and for problems where the distance is fixed ifTTI . In this Section we propose a 
hybrid approach that aims to take advantage of the best methods available. Algorithm [3] takes a closest 
string instance and partially solves it using Algorithm [T] If the upper and lower bounds come to within a 
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pre-defined tolerance, then numeric branch and bound methods are used to solve an Integer Programming 
formulation of the problem not yet solved by Minion. If the distance under consideration ever becomes 
fixed, then the linear time methods set out in 1 1 1 1 can be applied. 

It should be stressed that these three methods (CSP, IP branch and bound, and linear time) need 
not be exclusive: once tolerance achieving bounds are found by Minion, the distributed Minion search 
can continue, and the CSP and numeric methods are then competing to find the first solution. This of 
course pre-supposes that computational resource is not a problem, but that is why we are using web-scale 
facilities in the first place. 

4 Discussion 

We have performed (to our knowledge) the first evaluation of Constraint Satisfaction as a computational 
framework for solving closest string problems. We have shown that careful consideration of symbol 
occurrences can provide search heuristics that give, in general, several orders of magnitude speedup 
when computing approximate solutions. We have also shown that CSP is less effective when searching 
for certificates of distance optimality. This result motivated our detailed description of algorithms for 
web-scale distributed CSP computation, and also our design of hybrid distributed algorithms that can 
take advantage of the strengths of both numeric and CSP computational techniques. 

We have also performed (to our knowledge) the first analysis of the computational difficulties in- 
volved in the identification of all closest strings for a given input set, irrespective of the computational 
framework used. Our results for all closest strings motivate the question of which definition of self- 
similarity is suitable for the computational biology setting. In terms of information theory the all closest 
strings problem can not exclude alphabet symbols and still be correct. However, when seeking to quan- 
tify the self-similarity of DNA sequences it may be perfectly justifiable to exclude closest strings that can 
have no symbol in common with the sequences in question at a given point. If this were to be the case, 
then the computational efficiency of the search for all closest strings would be greatly increased (Figure 

We have designed, implemented and deployed a computational methodology for distributed search 
for closest string solutions. This contribution provides a practically useful means of attacking the NP- 
complete instances by division into smaller sub-problems. Our system is guaranteed never to perform the 
same search twice, will recover seamlessly from any unforseen loss of compute nodes, and is extendable 
to web-scale clouds. 

The limitations of this study are that we have not been able to compare numeric solutions to CSP 
solutions directly, (nor assess the hybrid numeric-C5P algorithm described in the paper), and that we have 
not attacked real world problems in a distributed setting, instead solving randomly-generated instances. 
We have outlined the possibility of super-linear speedups for distributed search, but present no supporting 
evidence as our distributed implementation has as yet been tested solely for accuracy, scalability and 
robustness. 

4.1 Future Work 

Possible future avenues of research include 

• Performing full-scale cloud searches for solutions to real-world closest string problems (rather 
than concentrating on randomly-generated problem instances as for this paper) 

• The provision of a fully distributed IP branch and bound solver for use in Algorithm [3] 
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• Experimentation with the directed graph CSP formulation described by Meneses et al. fTT| to 
improve their IP formulation as an alternative CSP model for closest strings 

• Investigating other NP-hard string sequence problems such as closest substring, farthest string and 
^-mismatch - an obvious candidate is consensus string, which differs from closest string only in 
that the objective is to minimise the sum of the distances, rather than minimise the maximum 
individual distance 

• Search for more complicated metrics than Hamming distance that better capture the concepts of 
difference and similarity for nucleotide sequences - recent results involving Markov models 1 241 
suggest that judicious choice of metric has profound implications for both the theory and practice 
of identifying self-similarity amongst sequences. 
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